library(dplyr)
library(MASS)
library(imputeTS)
library(rpart)
library(glmnet)
library(zoo)
library(formattable)
library(kableExtra)
library(chron)
library(tidyverse)
library(caret)
library(rpart.plot)
library(randomForest)
library(lmtest)
library(gridExtra)
library(xgboost)
library(faraway)
library(nlme)
library(corrplot)
library(imputeTS)
The main objective of this project is to predict the sales of the 111 potentially weather sensitive products like (umbrellas, bread, milk) around the time of major weather events at 45 of the Walmart’s retail locations using statistical and machine learning techiniques. This is an important problem for any big retailer as it will help the replenishment manager to correctly plan and manage the inventory to avoid being out-of-stock or overstock during and after a storm. This will also help in timely intervention to manage the risks arising from the other unforseen circumstances.
Walmart is an American multinational retail corporation that operates a chain of hypermarkets, discount department stores, and grocery stores, headquartered in Bentonville, Arkansas. The company was founded by Sam Walton in 1962 and incorporated on October 31, 1969. It also owns and operates Sam’s Club retail warehouses. As of October 31, 2019, Walmart has 11,438 stores and clubs in 27 countries, operating under 55 different names.The company operates under the name Walmart in the United States and Canada, as Walmart de México y Centroamérica in Mexico and Central America, as Asda in the United Kingdom, as the Seiyu Group in Japan, and as Best Price in India. It has wholly owned operations in Argentina, Chile, Canada, and South Africa. Since August 2018, Walmart only holds a minority stake in Walmart Brasil, which was renamed Grupo Big in August 2019, with 20 percent of the company’s shares, and private equity firm Advent International holding 80 percent ownership of the company.
Inventory Management: Inventory management is a tricky function of any business and especially with business that are consumer facing, the management of the inventory for different stock keeping units at an optimal quantity becomes one of the critical parts of the business. Anything more than the optimal value might create overhead costs which might not be helpful as it might bring down the profitability and anything below the optimal number can cause consumers to leave the store and buy somewhere else. Hence, it become a key function for any consumer facing business to maintain an optimal invertory for each sku. The problem of inventory managment boils down to one important metric and that is nothing but the demand of a paricular item - which is difficult to predit without any scientific method and also time consuming. Therefore, the goal for a firm is to predict the demand for sku’s and matain those.
Problem faced by Walmart: With more than 11,000 stores across different geographies, inventory management becomes a necessay task for Walmart to focus on their custormers needs and to make sure the right products are available at the right place in the right quantity. The prediction problem at hand is for a subset of sku’s - weather senstitive items like umbrellas, bread etc. around the time of the major weather events that makes it even more variable to predict the demand. Therefore, the task is to use analytical techniques to predict the demand for those essential sku’s during the major weather events.
The data has been download from the Kaggle website. This dataset contains sales data for 111 products whose sales may be affected by weather (such as milk, bread , umbrellasetc.) These 111 products are sold in stores at 45 different Walmart locations. Some of the products may be a similar item (such as milk) but have a different id in different stores/regions/suppliers. The 45 locations are covered by 20 weather stations (i.e. some of the stores are nearby and share a weather station).
For the purposes of this project, a weather event is defined as any day in which more than an inch of rain or two inches of snow was observed. The prediction is the units sold for a window of ±3 days surrounding each storm.
The following graphic shows the layout of the test windows. The green dots are the training set days, the red dots are the test set days, and the event=True are the days with storms. Note that this plot is for the 20 weather stations. All days prior to 2013-04-01 are given out as training data.
## Loading required package: bitops
##
## Attaching package: 'RCurl'
## The following object is masked from 'package:lmtest':
##
## reset
## The following object is masked from 'package:tidyr':
##
## complete
The descriptions of the data files and the field descriptions are given below.
Field descriptions
date - the day of sales or weatherstore_nbr - an id representing one of the 45 storesstation_nbr - an id representing one of 20 weather stationsitem_nbr - an id representing one of the 111 productsunits - the quantity sold of an item on a given dayid - a triplet representing a store_nbr, item_nbr, and date. Form the id by concatenating these (in that order) with an underscore. E.g. “2_1_2013-04-01” represents store 2, item 1, sold on 2013-04-01.File descriptions
key.csv - the relational mapping between stores and the weather stations that cover themsampleSubmission.csv - file that gives the prediction formattrain.csv - sales data for all stores & dates in the training settest.csv - stores & dates for forecasting (missing ‘units’ that I have to predict)weather.csv - a file containing the NOAA weather information for each station and daynoaa_weather_qclcd_documentation.pdf - a guide to understand the data provided in the weather.csv fileThe data from raw csv files is initially loaded into R dataframes.
On early examination of the train dataset, we observed that for a large number of items, there were no sales recorded. Refer below image:
With no past sales data with us, it makes sense to remove these combinations from training set. On encountering such combination in our test set, we would predict the sales to be zero.
As discussed above, we first filter store and item number combinations having zero sales from the training dataset. Using attribute store_nbr and item_nbr, dataframe key is joined to train dataframe. The resulting dataframe is then joined to weather dataframe using attribute station_nbr and date. This join helped us to link the weather data recorded by the weather station in a particular store’s region to the corresponding sales data for given date.Similar join was also performed on test dataset to get required features for prediction. R function merge was used to implement this join.
Various important features can be extracted from the date attribute that impact the sales. Some of the examples include day of the week, month, year,etc. Below is a description about the new columns generated from Date:
Below are some plots to study relationship between different weather attributes and units sold:
It can be observed from the graph that the amount of snowfall received and units sold have an inverse relationship. It makes sense since people generally will avoid going outdoors and shopping on snowy days. When there is no snowfall, the number of items sold is highest.
Precipitation can be viewed as the chances of receiving rain, snow, sleet, freezing rain, and hail. While plotting, we removed outliers by removing datapoints exceeding 1.5 time the interquartile range. A trend similar to snowfall is observed in precipitation with respect to snowfall. On a clear sunny day, sale of items is maximum. Since precipitation will account for many other adverse weather conditions along with snowfall, it makes sense to include precipitation instead of snowfall.
tmax attribute gives the average daily temperature. Looking at the graph, we can conclude that sale of items is less for extreme temperatures, while days with temperatures between 40-90 degrees recorded high sales.
Sealevel provides the atmospheric pressure with respect to sea level. Pressures around 30 shows maximum item sales and it appears to be.
1.Items sold per store
The first graph shows the total sales per stores grouped yearly. It can be observed that the sales of items decreased between 2012 and 2014.
2.Items sold per month
The second graph shows the total sales per month. Being festive season, it can be observed that the sales are more between November and February. For rest of the year, the sales remain pretty constant. A small rise in sales can be observed during August, which also happens to be the beginning of academic term for students.
3.Items sold per day of the month
The third graph shows the variation of sales with the day of the month. The sales are more at the beginning of the month and drastically fall as the end of the month approaches. This logically makes sense, because, salaries are credited at the beginning of the month, increasing peoples purchasing power, which gradually decreases as the end of the month approaches.
4.Items sold per day of the week
The last graph shows the variation of sales with the day of the eek. The sales are low during weekdays and increases during Fridays and weekends, which again makes sense.
##Correlation Heatmap
Above figure shows correlation between available weather parameters and units sold. The blue colored tiles show positive correlation while red colored tiles are for negative correlation. The brightness of the tiles indicates the respective strength of correlation. It can be observed that units sale has a very weak correlation with weather parmeters, meaning that most of the weather parameters dont have a strong impact on the sales of items.
Our baseline model contains all the information regarding weather, item, store etc. except date, station_nbr and is_weekend because these details our available via other predictors that we have in our model. From the table above we can see that this model is not good due to mulitple reasons. There are various predictors such as \(dewpoint\), \(sealevel\), \(resultdir\) etc. that are non significant, that is, there \(p-value\) is greater than \(0.05\). The Kaggle score of our baseline model comes out to be around \(0.51477\) which indicates flaws in this model.
Below is the summary of the regression model:
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 2.20257 | 0.27217 | 8.09260 | 0.00000 |
| store_nbr2 | 0.30477 | 0.02127 | 14.32540 | 0.00000 |
| store_nbr3 | 0.13221 | 0.02089 | 6.32764 | 0.00000 |
| store_nbr4 | 0.57779 | 0.01967 | 29.38090 | 0.00000 |
| store_nbr5 | 0.06247 | 0.01723 | 3.62580 | 0.00029 |
| store_nbr6 | 0.22417 | 0.02330 | 9.62077 | 0.00000 |
| store_nbr7 | 0.39731 | 0.01888 | 21.04921 | 0.00000 |
| store_nbr8 | -0.02076 | 0.01836 | -1.13086 | 0.25812 |
| store_nbr9 | 0.13106 | 0.01732 | 7.56772 | 0.00000 |
| store_nbr10 | 0.25160 | 0.01661 | 15.14318 | 0.00000 |
| store_nbr11 | 0.13741 | 0.01742 | 7.88604 | 0.00000 |
| store_nbr12 | 0.26709 | 0.01699 | 15.71624 | 0.00000 |
| store_nbr13 | 0.25389 | 0.01773 | 14.31725 | 0.00000 |
| store_nbr14 | -0.55532 | 0.01566 | -35.46505 | 0.00000 |
| store_nbr15 | 0.21512 | 0.05994 | 3.58914 | 0.00033 |
| store_nbr16 | 0.28407 | 0.01896 | 14.98362 | 0.00000 |
| store_nbr17 | 1.39480 | 0.02053 | 67.93946 | 0.00000 |
| store_nbr18 | -0.51856 | 0.02064 | -25.12818 | 0.00000 |
| store_nbr19 | -0.08134 | 0.01689 | -4.81543 | 0.00000 |
| store_nbr20 | 0.26403 | 0.02174 | 12.14472 | 0.00000 |
| store_nbr21 | -0.01301 | 0.01981 | -0.65686 | 0.51127 |
| store_nbr22 | 0.23901 | 0.01704 | 14.02343 | 0.00000 |
| store_nbr23 | -0.30706 | 0.01944 | -15.79404 | 0.00000 |
| store_nbr24 | 0.59487 | 0.03424 | 17.37135 | 0.00000 |
| store_nbr25 | 0.28383 | 0.06009 | 4.72346 | 0.00000 |
| store_nbr26 | -0.01205 | 0.01905 | -0.63237 | 0.52714 |
| store_nbr27 | 0.13359 | 0.01735 | 7.70078 | 0.00000 |
| store_nbr28 | -0.16817 | 0.02243 | -7.49795 | 0.00000 |
| store_nbr29 | -0.07319 | 0.02095 | -3.49387 | 0.00048 |
| store_nbr30 | 0.19924 | 0.01776 | 11.21805 | 0.00000 |
| store_nbr31 | 0.25477 | 0.01683 | 15.13558 | 0.00000 |
| store_nbr32 | 0.06291 | 0.05991 | 1.05005 | 0.29369 |
| store_nbr33 | 0.54419 | 0.01773 | 30.70151 | 0.00000 |
| store_nbr34 | 0.27759 | 0.01690 | 16.42893 | 0.00000 |
| store_nbr35 | 0.06704 | 0.01746 | 3.84044 | 0.00012 |
| store_nbr36 | -0.23342 | 0.02008 | -11.62519 | 0.00000 |
| store_nbr37 | 0.02381 | 0.06017 | 0.39572 | 0.69231 |
| store_nbr38 | 0.29687 | 0.02023 | 14.67635 | 0.00000 |
| store_nbr39 | -2.03570 | 0.02111 | -96.45544 | 0.00000 |
| store_nbr40 | 0.06149 | 0.05992 | 1.02620 | 0.30480 |
| store_nbr41 | 0.22222 | 0.01659 | 13.39459 | 0.00000 |
| store_nbr42 | -0.02294 | 0.02363 | -0.97070 | 0.33170 |
| store_nbr43 | 0.15203 | 0.01710 | 8.88940 | 0.00000 |
| store_nbr44 | 0.17443 | 0.01708 | 10.21076 | 0.00000 |
| store_nbr45 | -0.62334 | 0.01951 | -31.95673 | 0.00000 |
| item_nbr2 | -0.13913 | 0.02897 | -4.80238 | 0.00000 |
| item_nbr3 | -0.41859 | 0.03422 | -12.23095 | 0.00000 |
| item_nbr4 | 0.19276 | 0.03524 | 5.46912 | 0.00000 |
| item_nbr5 | 3.03823 | 0.02574 | 118.02733 | 0.00000 |
| item_nbr6 | 2.28120 | 0.04470 | 51.02812 | 0.00000 |
| item_nbr7 | -0.60533 | 0.03488 | -17.35576 | 0.00000 |
| item_nbr8 | 2.77569 | 0.03591 | 77.29613 | 0.00000 |
| item_nbr9 | 2.93084 | 0.02501 | 117.18075 | 0.00000 |
| item_nbr10 | -0.34975 | 0.03379 | -10.35116 | 0.00000 |
| item_nbr11 | -0.53993 | 0.03399 | -15.88507 | 0.00000 |
| item_nbr12 | -0.59794 | 0.03295 | -18.14718 | 0.00000 |
| item_nbr13 | -0.51168 | 0.03425 | -14.94154 | 0.00000 |
| item_nbr14 | -0.65384 | 0.03572 | -18.30486 | 0.00000 |
| item_nbr15 | -0.32652 | 0.02702 | -12.08455 | 0.00000 |
| item_nbr16 | 2.59867 | 0.02649 | 98.08151 | 0.00000 |
| item_nbr17 | -0.44682 | 0.03272 | -13.65637 | 0.00000 |
| item_nbr18 | -0.52259 | 0.03475 | -15.03924 | 0.00000 |
| item_nbr19 | -0.35230 | 0.03288 | -10.71612 | 0.00000 |
| item_nbr20 | -0.13591 | 0.03442 | -3.94879 | 0.00008 |
| item_nbr21 | -0.29554 | 0.03010 | -9.81776 | 0.00000 |
| item_nbr22 | 0.08940 | 0.03524 | 2.53646 | 0.01120 |
| item_nbr23 | 2.04034 | 0.03053 | 66.82620 | 0.00000 |
| item_nbr24 | 0.03815 | 0.03439 | 1.10937 | 0.26727 |
| item_nbr25 | 3.18516 | 0.02988 | 106.58792 | 0.00000 |
| item_nbr26 | 1.31514 | 0.03524 | 37.31467 | 0.00000 |
| item_nbr27 | 1.16641 | 0.03482 | 33.49651 | 0.00000 |
| item_nbr28 | 1.05815 | 0.03429 | 30.85946 | 0.00000 |
| item_nbr29 | 0.18132 | 0.03489 | 5.19766 | 0.00000 |
| item_nbr30 | 0.22079 | 0.02798 | 7.89202 | 0.00000 |
| item_nbr31 | -0.55337 | 0.03353 | -16.50575 | 0.00000 |
| item_nbr32 | -0.34188 | 0.03442 | -9.93269 | 0.00000 |
| item_nbr33 | 0.14312 | 0.03339 | 4.28649 | 0.00002 |
| item_nbr34 | 0.05270 | 0.03524 | 1.49532 | 0.13483 |
| item_nbr35 | -0.00690 | 0.03509 | -0.19679 | 0.84399 |
| item_nbr36 | 3.58904 | 0.03074 | 116.74718 | 0.00000 |
| item_nbr37 | 2.00968 | 0.02714 | 74.05959 | 0.00000 |
| item_nbr38 | -0.08033 | 0.03313 | -2.42471 | 0.01532 |
| item_nbr39 | -0.62877 | 0.03025 | -20.78839 | 0.00000 |
| item_nbr40 | -0.49554 | 0.03429 | -14.45163 | 0.00000 |
| item_nbr41 | 2.50208 | 0.03104 | 80.61678 | 0.00000 |
| item_nbr42 | 0.39355 | 0.03379 | 11.64731 | 0.00000 |
| item_nbr43 | 2.56430 | 0.04470 | 57.36090 | 0.00000 |
| item_nbr44 | 3.93058 | 0.02501 | 157.15252 | 0.00000 |
| item_nbr45 | 3.43035 | 0.02582 | 132.84210 | 0.00000 |
| item_nbr46 | -0.49770 | 0.03423 | -14.53808 | 0.00000 |
| item_nbr47 | 0.03062 | 0.03429 | 0.89297 | 0.37188 |
| item_nbr48 | 0.58517 | 0.03534 | 16.55667 | 0.00000 |
| item_nbr49 | -0.67579 | 0.02823 | -23.93665 | 0.00000 |
| item_nbr50 | -0.34116 | 0.02717 | -12.55818 | 0.00000 |
| item_nbr51 | -0.51514 | 0.02737 | -18.82051 | 0.00000 |
| item_nbr52 | -0.31619 | 0.02978 | -10.61885 | 0.00000 |
| item_nbr53 | -0.32462 | 0.03313 | -9.79843 | 0.00000 |
| item_nbr54 | -0.54320 | 0.03414 | -15.91107 | 0.00000 |
| item_nbr55 | -0.23454 | 0.03292 | -7.12407 | 0.00000 |
| item_nbr56 | -0.02984 | 0.03119 | -0.95673 | 0.33871 |
| item_nbr57 | 0.16612 | 0.03555 | 4.67295 | 0.00000 |
| item_nbr58 | -0.25545 | 0.03288 | -7.77002 | 0.00000 |
| item_nbr59 | -0.08262 | 0.03438 | -2.40352 | 0.01624 |
| item_nbr60 | -0.79699 | 0.04470 | -17.82779 | 0.00000 |
| item_nbr61 | -0.43536 | 0.02838 | -15.34009 | 0.00000 |
| item_nbr62 | -0.45956 | 0.03433 | -13.38536 | 0.00000 |
| item_nbr63 | -0.07799 | 0.03439 | -2.26793 | 0.02333 |
| item_nbr64 | -0.59742 | 0.03426 | -17.43706 | 0.00000 |
| item_nbr65 | -0.49318 | 0.03425 | -14.40136 | 0.00000 |
| item_nbr66 | 2.10128 | 0.03439 | 61.10148 | 0.00000 |
| item_nbr67 | 0.66335 | 0.03349 | 19.80562 | 0.00000 |
| item_nbr68 | 2.75143 | 0.02726 | 100.91822 | 0.00000 |
| item_nbr69 | -0.00262 | 0.03443 | -0.07599 | 0.93943 |
| item_nbr70 | 0.17486 | 0.03339 | 5.23712 | 0.00000 |
| item_nbr71 | -1.73917 | 0.03534 | -49.20773 | 0.00000 |
| item_nbr72 | 0.22679 | 0.03555 | 6.37956 | 0.00000 |
| item_nbr73 | -0.47817 | 0.03422 | -13.97196 | 0.00000 |
| item_nbr74 | -0.47015 | 0.03475 | -13.53006 | 0.00000 |
| item_nbr75 | -0.24821 | 0.03443 | -7.20895 | 0.00000 |
| item_nbr76 | -1.04509 | 0.04470 | -23.37770 | 0.00000 |
| item_nbr77 | -0.51610 | 0.03426 | -15.06336 | 0.00000 |
| item_nbr78 | -0.51037 | 0.03414 | -14.94954 | 0.00000 |
| item_nbr79 | -0.49735 | 0.03418 | -14.55308 | 0.00000 |
| item_nbr80 | -0.49990 | 0.03430 | -14.57624 | 0.00000 |
| item_nbr81 | 0.03478 | 0.03295 | 1.05547 | 0.29121 |
| item_nbr82 | -0.28996 | 0.03602 | -8.05069 | 0.00000 |
| item_nbr83 | 1.70898 | 0.03443 | 49.63584 | 0.00000 |
| item_nbr84 | -0.74336 | 0.02830 | -26.26929 | 0.00000 |
| item_nbr85 | -0.94493 | 0.02816 | -33.55775 | 0.00000 |
| item_nbr86 | -0.74029 | 0.02656 | -27.87227 | 0.00000 |
| item_nbr87 | -0.20620 | 0.03431 | -6.00937 | 0.00000 |
| item_nbr88 | -0.23080 | 0.02997 | -7.70050 | 0.00000 |
| item_nbr89 | -0.53845 | 0.03429 | -15.70314 | 0.00000 |
| item_nbr90 | -0.62211 | 0.03482 | -17.86558 | 0.00000 |
| item_nbr91 | -0.26371 | 0.03447 | -7.65001 | 0.00000 |
| item_nbr92 | -0.77514 | 0.03352 | -23.12829 | 0.00000 |
| item_nbr93 | -0.50973 | 0.02539 | -20.07513 | 0.00000 |
| item_nbr94 | -0.30198 | 0.03433 | -8.79543 | 0.00000 |
| item_nbr95 | 0.57535 | 0.03488 | 16.49606 | 0.00000 |
| item_nbr96 | -1.06208 | 0.03482 | -30.50037 | 0.00000 |
| item_nbr97 | -1.04388 | 0.03482 | -29.97760 | 0.00000 |
| item_nbr98 | -0.52685 | 0.02976 | -17.70042 | 0.00000 |
| item_nbr99 | -0.38210 | 0.03429 | -11.14344 | 0.00000 |
| item_nbr100 | -0.28132 | 0.03438 | -8.18368 | 0.00000 |
| item_nbr101 | -0.96979 | 0.03482 | -27.84985 | 0.00000 |
| item_nbr102 | -0.71879 | 0.03408 | -21.09320 | 0.00000 |
| item_nbr103 | 1.57786 | 0.03573 | 44.15768 | 0.00000 |
| item_nbr104 | -0.71397 | 0.02759 | -25.88257 | 0.00000 |
| item_nbr105 | -0.63953 | 0.02695 | -23.73134 | 0.00000 |
| item_nbr106 | -0.72823 | 0.02991 | -24.34534 | 0.00000 |
| item_nbr107 | -0.54561 | 0.03572 | -15.27505 | 0.00000 |
| item_nbr108 | -0.21445 | 0.03410 | -6.28879 | 0.00000 |
| item_nbr109 | -0.02651 | 0.03005 | -0.88230 | 0.37762 |
| item_nbr110 | -0.42641 | 0.03431 | -12.42696 | 0.00000 |
| item_nbr111 | 1.54725 | 0.03573 | 43.30121 | 0.00000 |
| tmax | -0.00269 | 0.00038 | -7.12875 | 0.00000 |
| tmin | -0.00008 | 0.00042 | -0.20089 | 0.84078 |
| dewpoint | 0.00000 | 0.00028 | 0.01216 | 0.99029 |
| wetbulb | 0.00095 | 0.00048 | 1.96904 | 0.04895 |
| heat | -0.00014 | 0.00064 | -0.21569 | 0.82923 |
| cool | 0.00254 | 0.00067 | 3.76938 | 0.00016 |
| snowfall | -0.00183 | 0.00484 | -0.37761 | 0.70572 |
| preciptotal | 0.04391 | 0.00502 | 8.73988 | 0.00000 |
| stnpressure | -0.00088 | 0.01040 | -0.08459 | 0.93259 |
| sealevel | -0.03931 | 0.01291 | -3.04563 | 0.00232 |
| resultspeed | -0.00125 | 0.00076 | -1.64607 | 0.09975 |
| resultdir | -0.00015 | 0.00014 | -1.02775 | 0.30407 |
| avgspeed | 0.00121 | 0.00087 | 1.38559 | 0.16587 |
| month02 | -4.80778 | 0.21453 | -22.41113 | 0.00000 |
| month03 | -9.17597 | 0.41065 | -22.34514 | 0.00000 |
| month04 | -13.96321 | 0.62519 | -22.33438 | 0.00000 |
| month05 | -18.57670 | 0.83282 | -22.30578 | 0.00000 |
| month06 | -23.33453 | 1.04729 | -22.28087 | 0.00000 |
| month07 | -27.96589 | 1.25491 | -22.28518 | 0.00000 |
| month08 | -32.72793 | 1.46935 | -22.27372 | 0.00000 |
| month09 | -37.55732 | 1.68391 | -22.30367 | 0.00000 |
| month10 | -42.17783 | 1.89145 | -22.29925 | 0.00000 |
| month11 | -46.96653 | 2.10614 | -22.29982 | 0.00000 |
| month12 | -51.52611 | 2.31363 | -22.27071 | 0.00000 |
| year13 | -56.32006 | 2.52670 | -22.28996 | 0.00000 |
| year14 | -112.60552 | 5.05225 | -22.28820 | 0.00000 |
| day02 | -0.19638 | 0.01216 | -16.14451 | 0.00000 |
| day03 | -0.35054 | 0.01702 | -20.59418 | 0.00000 |
| day04 | -0.52692 | 0.02302 | -22.89080 | 0.00000 |
| day05 | -0.63025 | 0.02942 | -21.42230 | 0.00000 |
| day06 | -0.82694 | 0.03598 | -22.98173 | 0.00000 |
| day07 | -0.97482 | 0.04270 | -22.82743 | 0.00000 |
| day08 | -1.14364 | 0.04947 | -23.11574 | 0.00000 |
| day09 | -1.31273 | 0.05627 | -23.33107 | 0.00000 |
| day10 | -1.40344 | 0.06308 | -22.24713 | 0.00000 |
| day11 | -1.57385 | 0.06993 | -22.50538 | 0.00000 |
| day12 | -1.75806 | 0.07678 | -22.89618 | 0.00000 |
| day13 | -1.91101 | 0.08364 | -22.84933 | 0.00000 |
| day14 | -2.08284 | 0.09053 | -23.00690 | 0.00000 |
| day15 | -2.23627 | 0.09742 | -22.95583 | 0.00000 |
| day16 | -2.40891 | 0.10428 | -23.09943 | 0.00000 |
| day17 | -2.57709 | 0.11114 | -23.18859 | 0.00000 |
| day18 | -2.72382 | 0.11801 | -23.08066 | 0.00000 |
| day19 | -2.89539 | 0.12490 | -23.18132 | 0.00000 |
| day20 | -3.03859 | 0.13182 | -23.05117 | 0.00000 |
| day21 | -3.20127 | 0.13869 | -23.08149 | 0.00000 |
| day22 | -3.37318 | 0.14561 | -23.16652 | 0.00000 |
| day23 | -3.51796 | 0.15251 | -23.06643 | 0.00000 |
| day24 | -3.66229 | 0.15942 | -22.97329 | 0.00000 |
| day25 | -3.85695 | 0.16631 | -23.19070 | 0.00000 |
| day26 | -3.98035 | 0.17325 | -22.97491 | 0.00000 |
| day27 | -4.13572 | 0.18016 | -22.95534 | 0.00000 |
| day28 | -4.29873 | 0.18709 | -22.97739 | 0.00000 |
| day29 | -4.45766 | 0.19387 | -22.99341 | 0.00000 |
| day30 | -4.60301 | 0.20092 | -22.90950 | 0.00000 |
| day31 | -4.74245 | 0.20791 | -22.81030 | 0.00000 |
| weekdaysMonday | -0.09288 | 0.00475 | -19.55286 | 0.00000 |
| weekdaysTuesday | -0.15502 | 0.00476 | -32.59206 | 0.00000 |
| weekdaysWednesday | -0.18705 | 0.00475 | -39.38681 | 0.00000 |
| weekdaysThursday | -0.18067 | 0.00476 | -37.98422 | 0.00000 |
| weekdaysFriday | -0.11609 | 0.00476 | -24.39134 | 0.00000 |
| weekdaysSaturday | -0.02548 | 0.00478 | -5.33550 | 0.00000 |
| day_series | 0.15395 | 0.00692 | 22.25046 | 0.00000 |
| is_holiday | -0.08612 | 0.00633 | -13.60435 | 0.00000 |
From summary,it can be observed that most of the weather features have high p-value, which further supports our hypothesis that weather attributes dont have strong effect on sales. Using p-values as a reference, we decided to include tmax,preciptotal and sealevel as a metric in our model.
Let’s take a look at the diagnostic plots:
1.Residuals vs Fitted
The first diagnostic plot is used to identify non-linear patterns between residuals and fitted values. Looking at the plot, a specific pattern appears to be emerging. That is the points are not randomly distributed along the line. Hence linear assumption doesn’t hold true in this case.
2.Normal Q-Q Plot
The Q-Q plot is used to test the normality assumption of residuals. It can be observed that the residuals are not normally distributed. We need to address this issue before we can fit a linear model.
3.Scale Location Plot
This plot is used to test the homoscedasticity of the errors. A horizontal line with equally distributed points indicates homoscedasticity. Looking at the plot for our data, we see that the errors are not homoscedastic.
4.Residuals vs Leverage plot
This plot helps us to identify influential points by using Cooks distance. Since no points lie outside the Cooks distance threshold, we conclude that there are no influential points.
Our Kaggle score for the baseline model is \(0.51477\). It is very close to the all-zeros solution as we have not done any feature selection and just run the model on the entire dataset. This results in values very close to \(0\) as the model is not sure of its predictions.
## store_nbr2 store_nbr3 store_nbr4 store_nbr5
## 5.151714e+00 6.079205e+00 6.692746e+00 4.097466e+00
## store_nbr6 store_nbr7 store_nbr8 store_nbr9
## 4.962212e+00 4.286724e+00 4.817230e+00 4.408312e+00
## store_nbr10 store_nbr11 store_nbr12 store_nbr13
## 3.187772e+00 4.237393e+00 3.419900e+00 3.783850e+00
## store_nbr14 store_nbr15 store_nbr16 store_nbr17
## 4.789995e+00 7.437912e+01 6.342243e+00 4.129591e+00
## store_nbr18 store_nbr19 store_nbr20 store_nbr21
## 4.207465e+00 5.154341e+00 6.582161e+00 3.744624e+00
## store_nbr22 store_nbr23 store_nbr24 store_nbr25
## 4.053919e+00 3.734261e+00 1.461720e+01 6.570373e+01
## store_nbr26 store_nbr27 store_nbr28 store_nbr29
## 2.700409e+00 3.513809e+00 4.706466e+00 4.186297e+00
## store_nbr30 store_nbr31 store_nbr32 store_nbr33
## 4.258567e+00 4.164823e+00 6.531201e+01 3.732290e+00
## store_nbr34 store_nbr35 store_nbr36 store_nbr37
## 4.196417e+00 5.339873e+00 4.053431e+00 6.588137e+01
## store_nbr38 store_nbr39 store_nbr40 store_nbr41
## 5.567301e+00 4.565187e+00 6.534262e+01 3.799511e+00
## store_nbr42 store_nbr43 store_nbr44 store_nbr45
## 3.843559e+00 4.139713e+00 4.028646e+00 5.621386e+00
## item_nbr2 item_nbr3 item_nbr4 item_nbr5
## 6.023229e+00 2.817508e+00 3.121994e+00 2.728956e+01
## item_nbr6 item_nbr7 item_nbr8 item_nbr9
## 5.064956e+00 2.974187e+00 3.112111e+00 2.466548e+01
## item_nbr10 item_nbr11 item_nbr12 item_nbr13
## 2.854424e+00 2.669674e+00 2.897008e+00 2.688382e+00
## item_nbr14 item_nbr15 item_nbr16 item_nbr17
## 2.948277e+00 9.216833e+00 1.273018e+01 2.856559e+00
## item_nbr18 item_nbr19 item_nbr20 item_nbr21
## 2.856891e+00 2.884092e+00 2.778083e+00 4.257525e+00
## item_nbr22 item_nbr23 item_nbr24 item_nbr25
## 3.121994e+00 4.525871e+00 2.658229e+00 4.249244e+00
## item_nbr26 item_nbr27 item_nbr28 item_nbr29
## 3.121994e+00 3.073090e+00 2.883941e+00 3.042676e+00
## item_nbr30 item_nbr31 item_nbr32 item_nbr33
## 1.991832e+00 2.810143e+00 2.778083e+00 2.801812e+00
## item_nbr34 item_nbr35 item_nbr36 item_nbr37
## 3.121994e+00 2.912818e+00 4.743633e+00 1.045551e+01
## item_nbr38 item_nbr39 item_nbr40 item_nbr41
## 2.928886e+00 4.176377e+00 2.883941e+00 4.586203e+00
## item_nbr42 item_nbr43 item_nbr44 item_nbr45
## 2.854424e+00 5.064956e+00 7.375485e+00 2.501503e+01
## item_nbr46 item_nbr47 item_nbr48 item_nbr49
## 2.865500e+00 2.883941e+00 3.096845e+00 5.701322e+00
## item_nbr50 item_nbr51 item_nbr52 item_nbr53
## 8.866756e+00 7.461361e+00 4.249170e+00 2.928886e+00
## item_nbr54 item_nbr55 item_nbr56 item_nbr57
## 2.763825e+00 2.892322e+00 4.761282e+00 3.159643e+00
## item_nbr58 item_nbr59 item_nbr60 item_nbr61
## 2.884092e+00 2.870615e+00 5.064956e+00 5.901067e+00
## item_nbr62 item_nbr63 item_nbr64 item_nbr65
## 2.764220e+00 2.658229e+00 2.656970e+00 2.688382e+00
## item_nbr66 item_nbr67 item_nbr68 item_nbr69
## 2.658229e+00 2.804649e+00 8.640702e+00 2.748699e+00
## item_nbr70 item_nbr71 item_nbr72 item_nbr73
## 2.801812e+00 3.096845e+00 3.159643e+00 2.817508e+00
## item_nbr74 item_nbr75 item_nbr76 item_nbr77
## 2.856891e+00 2.748699e+00 5.064956e+00 2.656970e+00
## item_nbr78 item_nbr79 item_nbr80 item_nbr81
## 2.763825e+00 2.809484e+00 2.789085e+00 2.897008e+00
## item_nbr82 item_nbr83 item_nbr84 item_nbr85
## 2.997667e+00 2.748699e+00 6.034325e+00 5.498893e+00
## item_nbr86 item_nbr87 item_nbr88 item_nbr89
## 1.178819e+01 2.791940e+00 4.323788e+00 2.883941e+00
## item_nbr90 item_nbr91 item_nbr92 item_nbr93
## 3.073090e+00 2.867841e+00 2.710865e+00 4.048164e+01
## item_nbr94 item_nbr95 item_nbr96 item_nbr97
## 2.764220e+00 2.974187e+00 3.073090e+00 6.121081e+00
## item_nbr98 item_nbr99 item_nbr100 item_nbr101
## 4.245740e+00 2.883941e+00 2.870615e+00 3.073090e+00
## item_nbr102 item_nbr103 item_nbr104 item_nbr105
## 2.747546e+00 3.313084e+00 7.506241e+00 1.033371e+01
## item_nbr106 item_nbr107 item_nbr108 item_nbr109
## 4.486464e+00 2.948277e+00 2.726780e+00 4.299551e+00
## item_nbr110 item_nbr111 tmax tmin
## 2.791940e+00 3.313084e+00 3.270896e+01 3.908446e+01
## dewpoint wetbulb heat cool
## 1.884055e+01 3.991413e+01 4.757838e+01 1.560751e+01
## snowfall preciptotal stnpressure sealevel
## 1.046240e+00 1.145537e+00 2.531872e+02 3.701896e+00
## resultspeed resultdir avgspeed month02
## 6.124752e+00 1.153738e+00 6.926234e+00 2.284637e+03
## month03 month04 month05 month06
## 9.095118e+03 1.962397e+04 3.414125e+04 5.162643e+04
## month07 month08 month09 month10
## 7.504148e+04 1.090469e+05 1.333416e+05 1.839436e+05
## month11 month12 year13 year14
## 1.585655e+05 1.954157e+05 8.948611e+05 3.162378e+06
## day02 day03 day04 day05
## 2.887618e+00 5.734152e+00 1.037695e+01 1.694440e+01
## day06 day07 day08 day09
## 2.594271e+01 3.628287e+01 4.849377e+01 6.264116e+01
## day10 day11 day12 day13
## 7.889220e+01 9.553245e+01 1.154619e+02 1.370900e+02
## day14 day15 day16 day17
## 1.585752e+02 1.827173e+02 2.114409e+02 2.459238e+02
## day18 day19 day20 day21
## 2.807471e+02 3.175667e+02 3.521279e+02 3.961864e+02
## day22 day23 day24 day25
## 4.406867e+02 4.802807e+02 5.257317e+02 5.516347e+02
## day26 day27 day28 day29
## 6.100958e+02 6.537563e+02 7.025292e+02 6.941209e+02
## day30 day31 weekdaysMonday weekdaysTuesday
## 7.125475e+02 4.940941e+02 1.736234e+00 1.723766e+00
## weekdaysWednesday weekdaysThursday weekdaysFriday weekdaysSaturday
## 1.731551e+00 1.731917e+00 1.736636e+00 1.740232e+00
## day_series is_holiday
## 2.672452e+06 1.144276e+00
From the VIF table also, we see that features such as \(day\_series\), \(stnpressure\), \(cool\), \(heat\) have very high Variance Inflation Factor values.
The Akaike information criterion (AIC) is an estimated measure of of the quality of each of the possible model. Here is the formula for finding AIC for the model: \(AIC\) = 2k - 2ln(L), where k is the number of parameters in the model and L is the likelihood of the model. We decided to show the backwards method, as forward method takes much more time. Step function, which selects the model based on AIC, works like this: by excluding one variable at a time, AIC for this variable turns out to be the AIC of the given model without this particular variable. After calculating all of the AIC, the variable without which model have higher AIC is excluded. This is done until the best model is found.
From this table we can see that the model was reduced to 13 variables based on AIC. The variables such as tmin, stnpressure, dewpoint, snowfall, heat, avgspeed and resultspeed were excluded from the model. Thus, we should consider to remove some of these variablse.
We decided to use LASSO Regression instead of Ridge Regression, because along with Regularization (variance ), LASSO Regression does variable selection by adding a penalty term to the absolute value of the magnitude of coefficients. This makes less contributive variables equal to zero.
From the above table for the LASSO Regression, we can see that the coefficients for tmin, dewpoint, heat, snowfall, resultspeed and avgspeed were set to zero. This indicates that these variables perform poorly in predicting the number of units purchased.
To evaluate the importance of categorical variables, we use Tukey test. Tukey test tells us if there is a difference between the different levels of the variable and whether we should keep it. From Appendix 1, we can observe that the p-values for the most of the intervals in the store number variable are less than 0.05. Thus, we conclude that store number variable is important and this is what we expected. The same analysis can be applied to item number variable.
From Appendix, we can see that p-values for almost all the intervals of weekdays are less than 0.05. This indicates that weekdays are significantly different from each other. The same procedure is applied to month, year and day variables. For these three variables, we observe similar results. Thus, we will keep these four variables in the model.
From the Correlation Table (That was shown in the Diagnostics), we saw that resultspeed, heat, tmin, dewpoint and wetbulb have high correlation with many other variables. This was confirmed from the VIF results too. Also, their p-values in the full model were low and the LASSO regression set their coefficients to zero. For these reasons, we remove these variables from the full model. Based on AIC, p-values and VIF analysis, snowfall, stnpressure, resultdir and avgspeed variables will be removed too.
From the Data Analysis part, we observed that month, day and weekdays variables follow some pattern, from which we can conclude that we can use them as continuous variables. That is why we will change their format to numeric. This will result in faster processing of the code and possibly better results.
One of the possible reasons, why we have nonnormal distribution of the errors and nonconstant variance, can be that our model underfits. Thus, we should test whether adding intersection variables and higher order variables can help. The possible intersection term can be store number and the maximum temperature, as stores are located in different cities and thereby have different maximum temperature.
We will use Breusch-Pagan test to see whether new model has a constant variance. The nul and alternative hypotheses are: \(H_0\): The errors have constant variance. \(H_1\): There is a heteroskedasticity
##
## studentized Breusch-Pagan test
##
## data: lm(log(units + 1) ~ store_nbr * tmax + item_nbr + tmax + cool + sealevel + preciptotal + month + year + day + weekdays + is_holiday, train_data)
## BP = 56010, df = 208, p-value < 2.2e-16
The p-value for Breusch–Pagan test is less than alpha, so we reject null hypothesis and conclude that residuals are still heteroskedastic.
Trying adding the intersection for number of stores and cool doesn’t improve the situation.
##
## studentized Breusch-Pagan test
##
## data: lm(log(units + 1) ~ store_nbr * cool + item_nbr + tmax + cool + sealevel + preciptotal + month + year + day + weekdays + is_holiday, train_data)
## BP = 55437, df = 208, p-value < 2.2e-16
Adding the second order terms for tmax and preciptotal does not improve the constancy of errors too.
##
## studentized Breusch-Pagan test
##
## data: lm(log(units + 1) ~ store_nbr + item_nbr + I(tmax^2) + cool + sealevel + preciptotal + month + year + day + weekdays + is_holiday, train_data)
## BP = 56109, df = 164, p-value < 2.2e-16
##
## studentized Breusch-Pagan test
##
## data: lm(log(units + 1) ~ store_nbr + item_nbr + I(preciptotal^2) + cool + sealevel + preciptotal + month + year + day + weekdays + is_holiday, train_data)
## BP = 55429, df = 164, p-value < 2.2e-16
We are not adding the intersection and higher degree terms to the store number and item number because it is not feasible and it will make our models run much slower.
The possible reason for nonconstancy of errors is that some variables are correlated with each other. That is why we will use Generalised Least Squares Model (GLSM) to see if it can make the erros more constant. The generalized least squares (GLS) estimator in linear regression is a generalization of the ordinary least squares (OLS) estimator. There are cases when the Ordinary Least Squares estimator is not BLUE (best linear unbiased estimator) mainly because the errors are not heteroscedastic. In such situations, provided that the other BLUE assumptions are true, the GLS estimator can be used.
From the above plot, we see that errors still have non-constant variance, so GLS model didn’t help.
As the number of observations is very high, we couldn’t run Generalized Linear Model to test whether GLM could make the errors normally distributed. The glm function for our model terminates Rstudio.
This paper discusses and shows that the t-test and least-squares linear regression do not require any assumption of Normal distribution in sufficiently large samples. As our dataset is quite large, we allow the errors to be not normally distributed. That is why we will use the following model as our best linear model.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 2.50902 | 0.23555 | 10.65159 | 0.00000 |
| store_nbr2 | 0.30478 | 0.01708 | 17.84468 | 0.00000 |
| store_nbr3 | 0.13313 | 0.01659 | 8.02275 | 0.00000 |
| store_nbr4 | 0.57716 | 0.01879 | 30.71056 | 0.00000 |
| store_nbr5 | 0.06524 | 0.01703 | 3.83093 | 0.00013 |
| store_nbr6 | 0.22417 | 0.01958 | 11.44851 | 0.00000 |
| store_nbr7 | 0.39878 | 0.01842 | 21.64466 | 0.00000 |
| store_nbr8 | -0.01990 | 0.01740 | -1.14375 | 0.25273 |
| store_nbr9 | 0.13071 | 0.01648 | 7.93181 | 0.00000 |
| store_nbr10 | 0.25437 | 0.01640 | 15.51220 | 0.00000 |
| store_nbr11 | 0.13875 | 0.01714 | 8.09714 | 0.00000 |
| store_nbr12 | 0.26998 | 0.01683 | 16.03853 | 0.00000 |
| store_nbr13 | 0.25536 | 0.01723 | 14.81813 | 0.00000 |
| store_nbr14 | -0.55497 | 0.01576 | -35.21837 | 0.00000 |
| store_nbr15 | 0.21634 | 0.01498 | 14.44432 | 0.00000 |
| store_nbr16 | 0.28424 | 0.01633 | 17.40057 | 0.00000 |
| store_nbr17 | 1.39428 | 0.01970 | 70.76909 | 0.00000 |
| store_nbr18 | -0.51890 | 0.01999 | -25.96097 | 0.00000 |
| store_nbr19 | -0.08344 | 0.01681 | -4.96493 | 0.00000 |
| store_nbr20 | 0.26496 | 0.01767 | 14.99899 | 0.00000 |
| store_nbr21 | -0.01304 | 0.01898 | -0.68710 | 0.49202 |
| store_nbr22 | 0.24034 | 0.01674 | 14.35617 | 0.00000 |
| store_nbr23 | -0.30740 | 0.01873 | -16.41119 | 0.00000 |
| store_nbr24 | 0.59424 | 0.03393 | 17.51189 | 0.00000 |
| store_nbr25 | 0.28505 | 0.01559 | 18.28833 | 0.00000 |
| store_nbr26 | -0.01239 | 0.01832 | -0.67643 | 0.49877 |
| store_nbr27 | 0.13493 | 0.01706 | 7.91086 | 0.00000 |
| store_nbr28 | -0.16724 | 0.01852 | -9.03124 | 0.00000 |
| store_nbr29 | -0.07321 | 0.02018 | -3.62767 | 0.00029 |
| store_nbr30 | 0.19881 | 0.01676 | 11.86486 | 0.00000 |
| store_nbr31 | 0.25443 | 0.01596 | 15.94088 | 0.00000 |
| store_nbr32 | 0.06413 | 0.01487 | 4.31324 | 0.00002 |
| store_nbr33 | 0.54416 | 0.01676 | 32.47460 | 0.00000 |
| store_nbr34 | 0.27725 | 0.01603 | 17.29652 | 0.00000 |
| store_nbr35 | 0.07822 | 0.01647 | 4.75033 | 0.00000 |
| store_nbr36 | -0.23543 | 0.01967 | -11.96608 | 0.00000 |
| store_nbr37 | 0.02503 | 0.01590 | 1.57435 | 0.11541 |
| store_nbr38 | 0.29687 | 0.01573 | 18.86825 | 0.00000 |
| store_nbr39 | -2.03146 | 0.02092 | -97.08759 | 0.00000 |
| store_nbr40 | 0.06272 | 0.01493 | 4.20189 | 0.00003 |
| store_nbr41 | 0.22499 | 0.01637 | 13.74162 | 0.00000 |
| store_nbr42 | -0.02294 | 0.01998 | -1.14777 | 0.25106 |
| store_nbr43 | 0.15492 | 0.01694 | 9.14321 | 0.00000 |
| store_nbr44 | 0.17720 | 0.01688 | 10.49785 | 0.00000 |
| store_nbr45 | -0.62300 | 0.01964 | -31.72033 | 0.00000 |
| item_nbr2 | -0.13913 | 0.02920 | -4.76507 | 0.00000 |
| item_nbr3 | -0.41859 | 0.03449 | -12.13592 | 0.00000 |
| item_nbr4 | 0.19276 | 0.03552 | 5.42663 | 0.00000 |
| item_nbr5 | 3.03823 | 0.02594 | 117.11037 | 0.00000 |
| item_nbr6 | 2.28120 | 0.04505 | 50.63168 | 0.00000 |
| item_nbr7 | -0.60533 | 0.03515 | -17.22092 | 0.00000 |
| item_nbr8 | 2.77569 | 0.03619 | 76.69562 | 0.00000 |
| item_nbr9 | 2.93084 | 0.02521 | 116.27037 | 0.00000 |
| item_nbr10 | -0.34975 | 0.03405 | -10.27075 | 0.00000 |
| item_nbr11 | -0.53993 | 0.03426 | -15.76166 | 0.00000 |
| item_nbr12 | -0.59794 | 0.03321 | -18.00620 | 0.00000 |
| item_nbr13 | -0.51168 | 0.03451 | -14.82546 | 0.00000 |
| item_nbr14 | -0.65384 | 0.03600 | -18.16265 | 0.00000 |
| item_nbr15 | -0.32652 | 0.02723 | -11.99067 | 0.00000 |
| item_nbr16 | 2.59867 | 0.02670 | 97.31951 | 0.00000 |
| item_nbr17 | -0.44682 | 0.03297 | -13.55027 | 0.00000 |
| item_nbr18 | -0.52259 | 0.03502 | -14.92240 | 0.00000 |
| item_nbr19 | -0.35230 | 0.03313 | -10.63287 | 0.00000 |
| item_nbr20 | -0.13591 | 0.03469 | -3.91811 | 0.00009 |
| item_nbr21 | -0.29554 | 0.03034 | -9.74148 | 0.00000 |
| item_nbr22 | 0.08940 | 0.03552 | 2.51675 | 0.01184 |
| item_nbr23 | 2.04034 | 0.03077 | 66.30703 | 0.00000 |
| item_nbr24 | 0.03815 | 0.03466 | 1.10075 | 0.27101 |
| item_nbr25 | 3.18516 | 0.03012 | 105.75983 | 0.00000 |
| item_nbr26 | 1.31514 | 0.03552 | 37.02478 | 0.00000 |
| item_nbr27 | 1.16641 | 0.03509 | 33.23627 | 0.00000 |
| item_nbr28 | 1.05815 | 0.03456 | 30.61971 | 0.00000 |
| item_nbr29 | 0.18132 | 0.03516 | 5.15728 | 0.00000 |
| item_nbr30 | 0.22079 | 0.02820 | 7.83071 | 0.00000 |
| item_nbr31 | -0.55337 | 0.03379 | -16.37752 | 0.00000 |
| item_nbr32 | -0.34188 | 0.03469 | -9.85552 | 0.00000 |
| item_nbr33 | 0.14312 | 0.03365 | 4.25319 | 0.00002 |
| item_nbr34 | 0.05270 | 0.03552 | 1.48371 | 0.13789 |
| item_nbr35 | -0.00690 | 0.03536 | -0.19526 | 0.84519 |
| item_nbr36 | 3.58904 | 0.03098 | 115.84016 | 0.00000 |
| item_nbr37 | 2.00968 | 0.02735 | 73.48422 | 0.00000 |
| item_nbr38 | -0.08033 | 0.03339 | -2.40587 | 0.01613 |
| item_nbr39 | -0.62877 | 0.03048 | -20.62689 | 0.00000 |
| item_nbr40 | -0.49554 | 0.03456 | -14.33936 | 0.00000 |
| item_nbr41 | 2.50208 | 0.03128 | 79.99047 | 0.00000 |
| item_nbr42 | 0.39355 | 0.03405 | 11.55683 | 0.00000 |
| item_nbr43 | 2.56430 | 0.04505 | 56.91526 | 0.00000 |
| item_nbr44 | 3.93058 | 0.02521 | 155.93160 | 0.00000 |
| item_nbr45 | 3.43035 | 0.02602 | 131.81004 | 0.00000 |
| item_nbr46 | -0.49770 | 0.03450 | -14.42513 | 0.00000 |
| item_nbr47 | 0.03062 | 0.03456 | 0.88603 | 0.37560 |
| item_nbr48 | 0.58517 | 0.03562 | 16.42804 | 0.00000 |
| item_nbr49 | -0.67579 | 0.02845 | -23.75069 | 0.00000 |
| item_nbr50 | -0.34116 | 0.02738 | -12.46061 | 0.00000 |
| item_nbr51 | -0.51514 | 0.02759 | -18.67430 | 0.00000 |
| item_nbr52 | -0.31619 | 0.03001 | -10.53635 | 0.00000 |
| item_nbr53 | -0.32462 | 0.03339 | -9.72231 | 0.00000 |
| item_nbr54 | -0.54320 | 0.03441 | -15.78746 | 0.00000 |
| item_nbr55 | -0.23454 | 0.03318 | -7.06872 | 0.00000 |
| item_nbr56 | -0.02984 | 0.03143 | -0.94929 | 0.34247 |
| item_nbr57 | 0.16612 | 0.03583 | 4.63664 | 0.00000 |
| item_nbr58 | -0.25545 | 0.03313 | -7.70966 | 0.00000 |
| item_nbr59 | -0.08262 | 0.03465 | -2.38485 | 0.01709 |
| item_nbr60 | -0.79699 | 0.04505 | -17.68929 | 0.00000 |
| item_nbr61 | -0.43536 | 0.02860 | -15.22091 | 0.00000 |
| item_nbr62 | -0.45956 | 0.03460 | -13.28137 | 0.00000 |
| item_nbr63 | -0.07799 | 0.03466 | -2.25031 | 0.02443 |
| item_nbr64 | -0.59742 | 0.03453 | -17.30159 | 0.00000 |
| item_nbr65 | -0.49318 | 0.03451 | -14.28948 | 0.00000 |
| item_nbr66 | 2.10128 | 0.03466 | 60.62678 | 0.00000 |
| item_nbr67 | 0.66335 | 0.03376 | 19.65175 | 0.00000 |
| item_nbr68 | 2.75143 | 0.02748 | 100.13418 | 0.00000 |
| item_nbr69 | -0.00262 | 0.03470 | -0.07540 | 0.93990 |
| item_nbr70 | 0.17486 | 0.03365 | 5.19643 | 0.00000 |
| item_nbr71 | -1.73917 | 0.03562 | -48.82544 | 0.00000 |
| item_nbr72 | 0.22679 | 0.03583 | 6.33000 | 0.00000 |
| item_nbr73 | -0.47817 | 0.03449 | -13.86341 | 0.00000 |
| item_nbr74 | -0.47015 | 0.03502 | -13.42494 | 0.00000 |
| item_nbr75 | -0.24821 | 0.03470 | -7.15294 | 0.00000 |
| item_nbr76 | -1.04509 | 0.04505 | -23.19608 | 0.00000 |
| item_nbr77 | -0.51610 | 0.03453 | -14.94633 | 0.00000 |
| item_nbr78 | -0.51037 | 0.03441 | -14.83340 | 0.00000 |
| item_nbr79 | -0.49735 | 0.03444 | -14.44002 | 0.00000 |
| item_nbr80 | -0.49990 | 0.03456 | -14.46299 | 0.00000 |
| item_nbr81 | 0.03478 | 0.03321 | 1.04727 | 0.29498 |
| item_nbr82 | -0.28996 | 0.03630 | -7.98814 | 0.00000 |
| item_nbr83 | 1.70898 | 0.03470 | 49.25022 | 0.00000 |
| item_nbr84 | -0.74336 | 0.02852 | -26.06520 | 0.00000 |
| item_nbr85 | -0.94493 | 0.02838 | -33.29704 | 0.00000 |
| item_nbr86 | -0.74029 | 0.02677 | -27.65573 | 0.00000 |
| item_nbr87 | -0.20620 | 0.03458 | -5.96268 | 0.00000 |
| item_nbr88 | -0.23080 | 0.03021 | -7.64068 | 0.00000 |
| item_nbr89 | -0.53845 | 0.03456 | -15.58114 | 0.00000 |
| item_nbr90 | -0.62211 | 0.03509 | -17.72678 | 0.00000 |
| item_nbr91 | -0.26371 | 0.03474 | -7.59058 | 0.00000 |
| item_nbr92 | -0.77514 | 0.03378 | -22.94861 | 0.00000 |
| item_nbr93 | -0.50973 | 0.02559 | -19.91917 | 0.00000 |
| item_nbr94 | -0.30198 | 0.03460 | -8.72710 | 0.00000 |
| item_nbr95 | 0.57535 | 0.03515 | 16.36791 | 0.00000 |
| item_nbr96 | -1.06208 | 0.03509 | -30.26341 | 0.00000 |
| item_nbr97 | -1.04388 | 0.03509 | -29.74470 | 0.00000 |
| item_nbr98 | -0.52685 | 0.03000 | -17.56290 | 0.00000 |
| item_nbr99 | -0.38210 | 0.03456 | -11.05687 | 0.00000 |
| item_nbr100 | -0.28132 | 0.03465 | -8.12010 | 0.00000 |
| item_nbr101 | -0.96979 | 0.03509 | -27.63349 | 0.00000 |
| item_nbr102 | -0.71879 | 0.03434 | -20.92933 | 0.00000 |
| item_nbr103 | 1.57786 | 0.03601 | 43.81462 | 0.00000 |
| item_nbr104 | -0.71397 | 0.02780 | -25.68149 | 0.00000 |
| item_nbr105 | -0.63953 | 0.02716 | -23.54697 | 0.00000 |
| item_nbr106 | -0.72823 | 0.03015 | -24.15620 | 0.00000 |
| item_nbr107 | -0.54561 | 0.03600 | -15.15638 | 0.00000 |
| item_nbr108 | -0.21445 | 0.03437 | -6.23993 | 0.00000 |
| item_nbr109 | -0.02651 | 0.03029 | -0.87544 | 0.38133 |
| item_nbr110 | -0.42641 | 0.03458 | -12.33041 | 0.00000 |
| item_nbr111 | 1.54725 | 0.03601 | 42.96480 | 0.00000 |
| month | -0.00592 | 0.00040 | -14.76957 | 0.00000 |
| weekdays | -0.00508 | 0.00065 | -7.87230 | 0.00000 |
| day | -0.00453 | 0.00015 | -31.12710 | 0.00000 |
| year13 | -0.10049 | 0.00303 | -33.19556 | 0.00000 |
| year14 | -0.19283 | 0.00323 | -59.74351 | 0.00000 |
| tmax | -0.00231 | 0.00011 | -20.63336 | 0.00000 |
| cool | 0.00382 | 0.00026 | 14.50828 | 0.00000 |
| preciptotal | 0.05127 | 0.00484 | 10.59565 | 0.00000 |
| sealevel | -0.05174 | 0.00773 | -6.69708 | 0.00000 |
| is_holiday | -0.07720 | 0.00606 | -12.73261 | 0.00000 |
The p-value for this model is less than 0.05 and adjusted \(R^2\) is equal to 0.8760862. Also, the p-value for each variable is significat too, this indicates that we keep only important variables. The kaggle score for this model is equal to 0.13302.
## Warning in `[<-.factor`(`*tmp*`, ri, value = c(36L, 3L, 26L, 28L, 43L,
## 45L, : invalid factor level, NA generated
In advanced analysis we use two very different machine learning models.
Decision Trees (DT) comes from the family of non-parametric supervised learning models. They are both used for classification as well as Regression. Decision Trees build their logic in the form of a tree by breaking down the dataset into smaller chunks while at the same time developing the tree.
The process of subsetting the data into smaller chunks begin by splitting. Splits are formed on a particular variables. Pruning is a process of shortening of branches of the tree. It is done by turning some branches of the tree to leaf nodes and removing the leaf nodes under the original branch. Pruning is required to avoid overfitting which is not a problem in our case, because we have a big data with minimal parameters.
So the question comes to mind is which tree is the best. There could be multiple features and henceforth multiple splits. In theory, multiple features are tried to split the dataset. But a feature that results in highest purity is chosen as the one for splitting. Entropy is a inversely proportional to purity. If a system is completely pure, that is, all the points in the system have same class or value, then the entropy of that system is said to be zero. So, the objective of tree models is to reduce entropy of the tree.
In brief, DT is like a flowchart with terminal/leaf nodes representing classification/decision values and inner nodes represent the logic which branches a tree to multiple directions.
We used DT for mulitple reasons. - The data that we have is very similar to that of a time series data. Although we don’t use time series modeling techniques like ARIMA, however we wanted to try out models that perform well on time related data. We experiment on the forecasting power of DT models. - From the diagnostics chart, we saw that Walmart data doesn’t fulfill the assumptions of a linear model. Decision Tree models don’t require any such assumption and work on data with different distributions. - It is computationally efficient.
One of the best part about tree based models is that it doesn’t require any fancy data preprocessing. So we used the same in DT that we used for Linear Regression. We didn’t do any further processing apart from basic feature engineering like that for Linear Models.
From the graph we can see that as the size of the tree increases, the cross validation error \(xerror\) decreases. But it saturates after some value of cp. \(cp\) refers to complexity parameter, as \(cp\) decreases, the size of the tree increases. We find the best tree of size 22 to perform best. The Kaggle Score of this model comes out to be \(0.15\) which is similar to that our linear model. It implies that we need more robust models that we explore in the later section.
Xgboost comes from a family of boosting models. Xgboost stands for eXtreme Gradient Bossting. XGB is an ensemble method that builds up multiple weak learners and try to improve their performances iteratively. Models are built sequentially by minimizing the errors from previous models (weak learners) while boosting the performance of high-performing models. Gradient boosting method applies Gradient Descent Algorithm to find minima and minimize errors in sequential models. It uses tree based models to create ensembles. Boosting is a process of subsetting data with replacement. So one data point could be present in multiple weak learners that are built sequentially. Gradient descent is an optimization algorithm used to minimize some function by iteratively moving opposite to the direction of the gradient. This helps in achieving minima for that function. Gradient descent is a core optimization technique in most of the Deep Learning algorithms that don’t have a closed form solution like Linear Regression.
We didn’t do any special treatment to the data and used the same format of data as we used to building linear regression models. Although, we removed missing data in preprocessing, ensemble methods like Xgboost don’t require this step.
We used 2-fold Cross Valiation and we find that the best \(RMSE\) comes out to be \(0.52\). Our Kaggle score comes out to be \(0.10951\) which way highes than other models that we have used for the analysis. Similar to rpart, RMSE decreases with increase in tree depth.
# library(dplyr)
# library(MASS)
# library(imputeTS)
# library(rpart)
# library(glmnet)
# library(zoo)
# library(formattable)
# library(kableExtra)
# library(chron)
# library(tidyverse)
# library(caret)
# library(rpart.plot)
# library(randomForest)
# library(lmtest)
# library(gridExtra)
# library(xgboost)
# library(faraway)
# library(nlme)
# library(corrplot)
# library(imputeTS)
# key_data <- read.csv("data/key.csv")
# train_data <- read.csv("data/train.csv")
# weather_data <- read.csv("data/weather.csv")
# test_data <- read.csv("data/test.csv")
# library("dplyr")
# library("ggplot2")
# sales_stats <- train_data %>% mutate(store_item_combo = paste(.$store_nbr,"-",.$item_nbr)) %>%
# group_by(store_item_combo) %>%
# summarize(total_items = sum(units)) %>%
# mutate(sales_flag = case_when(
# .$total_items == 0 ~ 'zero-sales',
# TRUE ~ 'non-zero sales'
# )
# ) %>%
# group_by(sales_flag) %>%
# count()
# ggplot(data=sales_stats, aes(x=sales_flag, y=n)) +
# geom_bar(stat="identity", width=0.5) + labs(x= "Store Item Combination", y = "Units Sold")
# store_items_cnt <- train_data %>%
# group_by(item_nbr, store_nbr) %>%
# summarize(total_items = sum(units))
# train_data <- merge(train_data, store_items_cnt[store_items_cnt$total_items != 0, ], by = c("item_nbr", "store_nbr"))
# train_data <- dplyr::select(merge(merge(train_data, key_data, by = "store_nbr"), weather_data, by = c("date", "station_nbr")),
# -c("total_items")
# )
# test_key_data <- merge(test_data, key_data, by = c("store_nbr"))
# test_data <- merge(test_key_data, weather_data, by = c("station_nbr", "date"))
# numerical_features <- c("tmax", "tmin", "dewpoint", "wetbulb", "heat", "cool", "stnpressure",
# "sealevel", "resultspeed", "resultdir", "avgspeed")
# trace_features <- c("snowfall", "preciptotal")
# factor_variables <- c("station_nbr", "item_nbr", "store_nbr")
# dropped_features <- c("codesum", "sunrise", "sunset", "depart", "tavg", "season", "isWeekend")
# holidays <- read.table('data/holidays.txt', sep = " ", fill = TRUE) %>%
# filter(str_detect(V1, '^20')) %>%
# mutate(date_holiday = str_c(V1, "-", V2, "-", V3), format = "%Y-%b-%d")
# holidays <- Reduce(c, lapply(holidays$date_holiday,FUN=function(x) as.Date(x, format = "%Y-%b-%d")))
# check_holiday <- function(x) {
# print(x)
# if(any(holidays == x)){
# TRUE
# }
# FALSE
# }
# create_features = function(df, is_train=TRUE, drop_columns=TRUE){
# new_date <- as.Date(df$date, "%Y-%m-%d")
# df$month <- as.factor(strftime(new_date, "%m"))
# df$year <- as.factor(strftime(new_date, "%y"))
# df$day <- as.factor(strftime(new_date, "%d"))
# df$weekdays <- as.factor(weekdays(as.Date(df$date, "%Y-%m-%d")))
# df$weekdays <- factor(df$weekdays, levels= c("Sunday", "Monday",
# "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
# df$is_weekend <- as.factor(df$weekdays %in% c("Sunday", "Saturday"))
# min_date <- min(as.Date(df$date),na.rm=TRUE)
# df$day_series <- as.numeric(as.Date(df$date)-min_date)
# df$is_holiday <- sapply(as.character(df$date), FUN = function(x) ifelse(any(as.character(holidays) == x), 1, 0))
# # train_data$holiday_info <- sapply(train_data$date, check_holiday)
# yq <- as.yearqtr(as.yearmon(df$date, "%Y-%m-%d") + 1/12)
# df$season <- as.factor(factor(format(yq, "%q"), levels = 1:4, labels = c("winter", "spring", "summer", "fall")))
# df$date <- as.factor(df$date)
# if(drop_columns == TRUE){
# df <- df[, !colnames(df) %in% dropped_features]
# }
# df[numerical_features] <- sapply(df[numerical_features], as.character)
# df[trace_features] <- sapply(df[trace_features], as.character)
# df[df$preciptotal == " T", ]$preciptotal <- "0.05"
# df[df$snowfall == " T", ]$snowfall <- "0.1"
# df[df == "M"] <- NA
# df[numerical_features] <- sapply(df[numerical_features], as.numeric)
# df[trace_features] <- sapply(df[trace_features], as.numeric)
# if(is_train == TRUE){
# df$units <- as.numeric(df$units)
# }
# for(i in numerical_features){
# df[is.na(df[,i]), i] <- median(df[,i], na.rm = TRUE)
# }
# for(i in trace_features){
# df[is.na(df[,i]), i] <- median(df[,i], na.rm = TRUE)
# }
# df[factor_variables] <- lapply(df[factor_variables], factor)
# df
# }
# train_data <- create_features(train_data)
# test_data <- create_features(test_data, is_train = FALSE)
# library(ggplot2)
# #Snowfall vs items sold
# IQR_snowfall=IQR(train_data$snowfall)
# items_sold_snowfall <- train_data[train_data$snowfall < 1,] %>%
# group_by(snowfall, year) %>%
# summarise(items_sold = sum(units))
# p1=ggplot(items_sold_snowfall, aes(x = snowfall, y = items_sold,group = year, colour = factor(year))) +
# geom_line() +
# geom_point() +
# labs(x= "snowfall", y = "Units Sold", colour = "Year") +
# theme_classic()
# #Precipitation vs items sold
# IQR_preciptotal=IQR(train_data$preciptotal)
# items_sold_precip <- train_data[train_data$preciptotal < 1.5*IQR_preciptotal,] %>%
# group_by(preciptotal, year) %>%
# summarise(items_sold = sum(units))
# p2 = ggplot(items_sold_precip, aes(x = preciptotal, y = items_sold, group = year, colour = factor(year))) +
# geom_line() +
# geom_point() +
# labs(x= "preciptotal", y = "Units Sold", colour = "Year") +
# #scale_x_discrete(limits = weekdays(as.Date(4,"1970-01-01",tz="GMT")+0:6)) +
# theme_classic()
# #tmax vs items sold
# items_sold_by_temperature <- train_data %>%
# group_by(tmax, year) %>%
# summarise(items_sold = sum(units))
# p3 = ggplot(items_sold_by_temperature, aes(x =tmax, y = items_sold, group = year, colour = factor(year))) +
# geom_line() +
# geom_point() +
# labs(x= "tmax", y = "Units Sold", colour = "Year") +
# theme_classic()
# #sealevel vs items sold
# items_sold_by_sealevel <- train_data %>%
# group_by(sealevel, year) %>%
# summarise(items_sold = sum(units))
# p4 = ggplot(items_sold_by_sealevel, aes(x =sealevel, y = items_sold, group = year, colour = factor(year))) +
# geom_line() +
# geom_point() +
# labs(x= "sealevel", y = "Units Sold", colour = "Year") +
# theme_classic()
# grid.arrange(p1,p2,p3,p4)
# col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
# corrplot(cor(train_data[c(numerical_features,trace_features,'units')]), method = "color", col = col(200),
# type = "upper", order = "hclust", number.cex = .7,
# addCoef.col = "black", # Add coefficient of correlation
# tl.col = "black", tl.srt = 90, # Text label color and rotation
# # Combine with significance
# p.mat = cor.mtest(train_data[c(numerical_features,trace_features,'units')])$p, sig.level = 0.01, insig = "blank",
# # hide correlation coefficient on the principal diagonal
# diag = FALSE)
# baseline_model <- lm(log(units + 1) ~ . -date -station_nbr -is_weekend, train_data)
# knitr::kable(summary(baseline_model)$coef, digits=5) %>%
# kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
# plot(baseline_model)
# vif(baseline_model)
# aic_b <- step(baseline_model, trace = FALSE)
# X <- model.matrix(baseline_model)
# lasso_baseline_cv <- cv.glmnet(X[, -1], log(train_data$units + 1), family="gaussian", alpha = 1)
# lasso_baseline <- glmnet(X[, -1], log(train_data$units + 1), family="gaussian", alpha = 1, lambda = lasso_baseline_cv$lambda.min)
# aic_b <- step(baseline_model)
# TukeyHSD(aov(units ~ store_nbr, train_data))
# TukeyHSD(aov(units ~ weekdays, train_data))
# TukeyHSD(aov(units ~ month, train_data))
# TukeyHSD(aov(units ~ day, train_data))
# TukeyHSD(aov(units ~ year, train_data))
# train_data$month <- as.POSIXlt(train_data$date)$mon
# train_data$day <- as.POSIXlt(train_data$date)$mday
# train_data$weekdays <- as.POSIXlt(train_data$date)$wday
# test_data$month <- as.POSIXlt(test_data$date)$mon
# test_data$day <- as.POSIXlt(test_data$date)$mday
# test_data$weekdays <- as.POSIXlt(test_data$date)$wday
# bptest(lm(log(units + 1) ~ store_nbr*tmax + item_nbr + tmax + cool + sealevel+ preciptotal + month + year + day + weekdays + is_holiday, train_data))
# bptest(lm(log(units + 1) ~ store_nbr*cool + item_nbr + tmax + cool + sealevel+ preciptotal + month + year + day + weekdays + is_holiday, train_data))
# bptest(lm(log(units + 1) ~ store_nbr+ item_nbr + I(tmax^2) + cool + sealevel + preciptotal + month + year + day + weekdays+ is_holiday, train_data))
# bptest(lm(log(units + 1) ~ store_nbr+ item_nbr + I(preciptotal^2) + cool + sealevel + preciptotal + month + year + day + weekdays + is_holiday, train_data))
# gls_model <- gls(log(units + 1) ~ store_nbr + item_nbr + tmax + cool + sealevel + preciptotal + month + year + day + weekdays + is_holiday, train_data)
# plot(gls_model)
# sample_mod <- lm(log(units + 1) ~ store_nbr + item_nbr + month + weekdays + day + year + tmax + cool + preciptotal + sealevel + is_holiday, train_data)
# knitr::kable(summary(sample_mod)$coef, digits=5) %>%
# kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
# test_data$date <- as.factor(test_data$date)
# test_data$predictions <- predict(rpart_mod, test_data)
# test_data$predictions <- exp(test_data$predictions) - 1
# test_data$predictions[which(test_data$predictions < 0)] <- 0
# temp <- store_items_cnt[store_items_cnt$total_items > 0 ,]
# merged = merge(test_data, temp, by.x = c("store_nbr", "item_nbr"), by.y = c("store_nbr", "item_nbr"), all.x = TRUE)
# merged[which(is.na(merged$total_items)), "predictions"] = 0
# id <- paste(merged$store_nbr, merged$item_nbr, merged$date, sep = "_")
# write.table(cbind(id, merged$predictions),file="data/results.csv",sep =",", row.names=F,col.names=c('id','units'))
# rpart_mod <- rpart(log(units + 1) ~ store_nbr + item_nbr + month + weekdays + day + year + tmax + cool + preciptotal + sealevel + is_holiday, train_data, cp = 0.001)
# set.seed(42)
# ctrl <- trainControl(method = "cv", number = 2, allowParallel = TRUE, verboseIter = FALSE, returnData = FALSE)
# xgb_mod <- caret::train(
# log(units + 1) ~ store_nbr + item_nbr + month + weekdays + day + year + tmax + cool + preciptotal + sealevel + is_holiday,
# train_data,
# method = "xgbTree",
# trControl = ctrl
# )
# plotcp(rpart_mod)